Posterior Sampling for Reinforcement Learning Without Episodes
نویسندگان
چکیده
We consider the problem of learning to optimize an unknown MDP M = (S,A, R, P ). S = {1, .., S} is the state space, A = {1, .., A} is the action space. In each timestep t = 1, 2, .. the agent observes a state st ∈ S, selects an action at ∈ A, receives a reward rt ∼ R(st, at) ∈ [0, 1] and transitions to a new state st+1 ∼ P (st, at). We define all random variables with respect to a probability space (Ω,F ,P).
منابع مشابه
Pure Exploration in Episodic Fixed-Horizon Markov Decision Processes
Multi-Armed Bandit (MAB) problems can be naturally extended to Markov Decision Processes (MDP). We extend the Best Arm Identification problem to episodic fixed-horizon MDPs. Here, the goal of an agent interacting with the MDP is to reach a high confidence on the optimal policy in as few episodes as possible. We propose Posterior Sampling for Pure Exploration (PSPE), a Bayesian algorithm for pur...
متن کامل(More) Efficient Reinforcement Learning via Posterior Sampling
Most provably-efficient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Mark...
متن کاملBootstrapped Thompson Sampling and Deep Exploration
This technical note presents a new approach to carrying out the kind of exploration achieved by Thompson sampling, but without explicitly maintaining or sampling from posterior distributions. The approach is based on a bootstrap technique that uses a combination of observed and artificially generated data. The latter serves to induce a prior distribution which, as we will demonstrate, is critic...
متن کاملEfficient Reinforcement Learning via Initial Pure Exploration
In several realistic situations, an interactive learning agent can practice and refine its strategy before going on to be evaluated. For instance, consider a student preparing for a series of tests. She would typically take a few practice tests to know which areas she needs to improve upon. Based of the scores she obtains in these practice tests, she would formulate a strategy for maximizing he...
متن کاملDeep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling
Recent advances in deep reinforcement learning have made significant strides in performance on applications such as Go and Atari games. However, developing practical methods to balance exploration and exploitation in complex domains remains largely unsolved. Thompson Sampling and its extension to reinforcement learning provide an elegant approach to exploration that only requires access to post...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1608.02731 شماره
صفحات -
تاریخ انتشار 2016